-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: exploratory branch for fast array views #5003
Conversation
We used to have a |
I do recall. Then we decided tuple, buffer and array were just too many vector-like types to explain to people. I'm thinking more something that you use to implement other things, not a "user-facing" data type, if you see what I mean. |
There was also the performance issue of having all the metadata associated with arrays in a composite type, IIRC. That may no longer be the case with immutable types. |
Yes, it might be more reasonable now. But in any case, it's not necessary for this. The thing that made me think of it is that this currently is written specifically to be a view of a vector, not a multidimensional array. I could make the type more general, but then I'd have to add a gratuitous type parameter for the parent array, which is annoying. I could also construct views of multidimensional arrays by reshaping them to 1-d and then taking a view of that, but that's a bit weird. |
|
LLVM does that automatically on x86 :-) See the native instructions here: c980bae – only div, no mod. |
Nice. I'll have to brush up on my assembly to really understand what's happening, but nice! |
Surprisingly, this branch doesn't seem to fix the performance problem, even though I can verify that the assembly is shorter. On this branch:
On master:
I've updated that gist to make sure that any differences in inlining (which seems to be prevented by assigning the output to a variable) are compensated. The assembly by gcc is here. |
Oh, and the profile suggests that the |
Hmm. Maybe I'll have to ponder hard how we might be able to get rid of the div. it's a really hard problem though. I'm not sure it's even possible. |
Here's a perhaps-interesting observation: with the |
@timholy I suspect that was with |
You're right that the gist did not include that optimization, but I noticed that too and tested it locally. As we noted before, it definitely makes a difference, but only about 10%. Performancewise, there's still a 3-4x gap between non-inlined performance with 2d arrays vs. subarrays in Julia. The two differences you noted (nice detective work, thanks!) seem like potential explanations of the performance gap. One test might be to compare to cartesian indexing
So there's a 2.25-fold gap for cartesian indexing. Cartesian indexing has the same issues you point out for linear indexing, but the gap is smaller. The tenfold-gap for linear indexing, by comparison to Arrays, will hopefully be mostly fixed by @vtjnash's inlining work? One last thing to note is the puzzling allocation of memory for linear-indexing SubArrays. |
The One simple observation though: I get the feeling that most uses of linear indexing involve simple iteration over the array (view) elements. It should be a lot simpler to do efficient linear iteration as compared to linear random access indexing. |
I have felt for a while that we need a different idiom for iterating over arrays than linear indexing. Linear indexing only makes sense for that application for simple dense arrays. |
I agree too; of course, a natural alternative is Cartesian indexing. Basically, Cartesian.jl was born in part to handle this issue with the performance of SubArray indexing. (It's more than that, but it was a major motivation.) Hence #1917 will almost certainly quickly become important in any discussion of alternatives to linear indexing. |
+1 I'm very excited about this. This is half of what I do all day. |
- only supports Vector{T} as the data array - multidimensional indexing via an arbitrary linear transform - linear indexing compatible with the multidimensional indexing
I'm not even sure why we were generating code to check for a zero denominator – this seems to still raise an exception as expected. The other special case just isn't worth it to emit more than one instruction for div – and div/rem since x86 does both together.
This version lets you see how tight the code gen can be without the gratuitous (now-removed) checks from the div intrinsic and the undef checks from loading the fields from the ArrayView structure: julia> code_native(view2data,(Int,(Int,Int),(Int,Int))) .section __TEXT,__text,regular,pure_instructions Filename: /Users/stefan/projects/julia/base/array/view.jl Source line: 17 push RBP mov RBP, RSP Source line: 17 mov R8, QWORD PTR [RSI + 16] mov RSI, QWORD PTR [RSI + 24] mov RAX, QWORD PTR [RDX + 16] mov RCX, QWORD PTR [RDX + 24] mov RCX, QWORD PTR [RCX + 8] dec RCX imul RCX, QWORD PTR [RSI + 8] mov RAX, QWORD PTR [RAX + 8] dec RAX imul RAX, QWORD PTR [R8 + 8] add RAX, RDI Source line: 19 lea RAX, QWORD PTR [RCX + RAX + 1] pop RBP ret julia> code_native(view2data,(Int,(Int,Int),(Int,Int),Int)) .section __TEXT,__text,regular,pure_instructions Filename: /Users/stefan/projects/julia/base/array/view.jl Source line: 23 push RBP mov RBP, RSP mov R8, RDX Source line: 23 lea RAX, QWORD PTR [RCX - 1] Source line: 25 mov RCX, QWORD PTR [RSI + 16] mov R9, QWORD PTR [RSI + 24] cqo idiv QWORD PTR [RCX + 8] mov RCX, RDX mov RDX, QWORD PTR [R8 + 16] mov RSI, QWORD PTR [R8 + 24] imul RCX, QWORD PTR [RDX + 8] add RCX, RDI cqo idiv QWORD PTR [R9 + 8] imul RDX, QWORD PTR [RSI + 8] Source line: 28 lea RAX, QWORD PTR [RDX + RCX + 1] pop RBP ret
Also support views into multidimensional arrays. This wasn't all that annoying, actually; I realized I could just add an `m` parameter.
julia> include(joinpath(JULIA_HOME,"../../base/array/view.jl")) benchmark (generic function with 4 methods) julia> results = benchmark(); julia> data = squeeze(mapslices(median,results[:,:,2:3]./results[:,:,1],2),2) 10x2 Array{Float64,2}: 6.64002 2.99833 6.69273 16.3886 8.59361 29.9632 10.4218 48.2432 12.1854 65.8061 13.2618 2357.6 16.1612 2534.79 17.3616 2367.67 17.1718 2754.53 20.8452 3254.62 julia> data./[1:size(data,1)] 10x2 Array{Float64,2}: 6.64002 2.99833 3.34636 8.19429 2.86454 9.98773 2.60545 12.0608 2.43708 13.1612 2.21029 392.934 2.30875 362.113 2.1702 295.958 1.90797 306.059 2.08452 325.462
unfortunately, this takes a damned long time to run because the multidimensional linear SubArray indexing is *soooo* slow.
@lindahua Just pinging you on this thread, in case you haven't seen it already. |
Yes, that's about right. The improvement is good but I'm still a bit sad that I can't get it even closer to direct linear indexing. It's pretty damned hard to compete with just lea + fetch. |
I do feel that we need to get array views as fast as array indexing. Perhaps we need to hack some support deep in the internals, but it would be worth it. |
I'm not entirely sure it can be done, regardless of how internal you get. There's a certain amount of arithmetic that has to be done to turn a linear index into an offset into the original array and there's a fair amount of non-linearity to it, so you can't just collapse the computations down all the way. But we may be able to get it within 2x if we're very careful. |
Can't we get the offset with |
You can do that, but you still have to do more arithmetic. Like I said, the relationship between the input index and the offset into the original array is not a simple linear relationship. |
In particular, the amount of work you need to do is proportional to the number of dimensions. |
Ok, those undef checks are really killing us. I just manually hoisted the lookups for those out of the benchmarking loop and got this: julia> squeeze(mapslices(median,10^9*results./sizes,2),2)
10x2 Array{Float64,2}:
1.93126 6.10509
1.47899 6.98912
1.40065 13.3511
1.50102 56.747
1.49355 78.5422
1.52309 93.0085
0.421136 33.0249
0.247773 21.4358
1.76813 131.094
1.5015 175.582
julia> squeeze(mapslices(median,results[:,:,2]./results[:,:,1],2),2)
10-element Array{Float64,1}:
3.15233
4.73653
9.32549
38.1872
52.2354
61.1526
78.704
84.6919
75.6112
117.906 I didn't run the SubArray benchmarks since they're slow and won't have changed. This is pretty respectable performance. I think that with some more performance tweaking and appropriate optimizations, we can get this really close to direct linear indexing, but unless someone comes up with a brilliant trick, there's always going to be a cost to linear indexing into high-dimensional array views. |
@StefanKarpinski, this is amazing work. With regards to what is ultimately possible, have you considered making more extensive comparisons against C? In particular, an analysis that I had done earlier seemed to suggest that for C, the arithmetic (at least for 2d) was inconsequential compared with cache misses. Are you running these tests with arrays too big to fit into L1 cache? Of course, it's also possible that there's something devious going on, and I misinterpret the results of the C code. Finally, there's another major issue: whether |
@StefanKarpinski This is great work. Thanks for taking the lead here. My two cents:
|
@timholy: trying out these techniques in C is a good idea since it will give a good sense of what the best performance we can possibly get would be. I worry that the C compiler might do things that wouldn't be valid for Julia's compiliation to do because it can perform static global analysis that we can't. @lindahua: I think you are absolutely right about contiguous views and it does seem like there are a lot of situations where we can statically determine that a view should be contiguous. I'll have to think about that a bit more – definitely a huge performance win. |
Although it might add additional challenges, I'd also like to explore
|
I definitely agree that leaning so hard on linear indexing is just causing a problem that doesn't need to exist. Now that we have a proposed hierarchy of array-like things, we should try to figure out a better general protocol for iterating over them that will be efficient for a wide range of implementations. I don't think that faster tuples help so much if you mean using tuples as state since it's not really the multiple pieces of state that we need but the multiple layers of control – we really need nested for loops, each of which has its own piece of state that it can update very simply. More pieces of state with a single control loop isn't any better because the update step still ends up being very complex – and it still happens on each iteration. |
That's a good point, but I was hoping that using a local tuple would allow LLVM's scalar conversion and hoisting to figure out which pieces of the indexing calculations are constant for each loop. We could also end up with each loop variable being an integer, but the programmer seeing a single tuple as each index. We need a |
I agree that more flexible iteration is highly desirable. @StefanKarpinski, on linear indexing and C: depending on how much work it is for you, one way to find out would be to look at the assembly generated by the code in that gist I linked to. That's 2d only, but it would be a start. |
Could someone clarify for me the state of the art regarding ArrayView, NumericExtensions.UnsafeVectorView, etc.? I am interested in present facilities and the near term future. The context is that I am writing a new edition of Bates and Watts, 1988, "Nonlinear Regression Analysis and Its Applications". This will be a large rewrite under the working title of "Nonlinear Regression with R" but I now propose changing that to "Nonlinear Regression with R and Julia". For generality, the model function evaluation for partially linear models (almost all nonlinear regression models have one or more parameters that occur linearly in the model) has a signature of ### Michaelis-Menten model for enzyme kinetics
function MicMenmmf(nlpars::StridedVector,x::StridedVector,
tg::StridedVector,MMD::StridedMatrix)
x1 = x[1]
denom = nlpars[1] + x1
MMD[1,1] = -(tg[1] = x1/denom)/denom
end because I am using On thinking about it, I am probably better off using duck typing as advocated by Steven Johnson and not using a signature. I still need to decide how to call such a function. All the arrays involved in the evaluation are configured so that the array views passed to the function will be contiguous views. Right now I am leaning toward using the unsafe views from the NumericExtensions package for these, because these are contiguous. Should I expect to continue to use those or will ArrayView's be preferred? |
@dmbates off-topic, but I would avoid using the A much better solution is to combine limited and judicious usage of |
Ok, let me make this more explicit. These stuff should be used with caution. But I don't think they are as risky as @vtjnash suggests. If one uses these views within a local scope (i.e. not passing them around) and carefully ensures that the indices are actually in bounds. There won't be a lot of risk. I introduced unsafe_views basically as a stopgap. For example, if you want to efficiently process columns one by one,
When the array views in Julia base become fast enough (which I am eagerly looking forward to), one can just do a search & replace to replace |
They aren't properly gc-rooted, so it is unsafe to assume that the original array will stick around, and you have to assume that any code that could get called on the array anywhere (esp. code that is expecting out-of-bounds conditions to throw an error) does not have typos. As I said before, if you want your code to require manual memory management and tracking of pointers, then just be honest about it and use C. I generally have a strong dislike of creating stopgaps, I would rather see the effort put into writing correct code and enjoying the speed increases as the underlying issues get addressed. |
I agree with your general opinions. However, I don't think using Admittedly, unsafe views are not a terribly elegant solution. I myself don't use them if there are other solutions with comparable performance (within 2x - 3x) & productivity. However, when the performance gap is up to 5x - 10x or even bigger, I don't see I still want to stick to the recommended ways, and have my program run 10 times slower. In the rare cases that I use unsafe_views in my codes, I strictly follow the pattern below: function bar(vec, ...)
....
end
function foo( a::Matrix, ... )
for i = 1 : size(a, 2)
col = unsafe_view(a, :, i);
bar(col, ....)
end
end I never throw the unsafe views around, and the code context makes it clear that the indices are within bounds. I don't see how the use of unsafe views under such restricted conditions would do any harm. I can certainly implement all these in C. But it means that I also have to migrate the related functions (such as Generally, I suggest writing codes using safe ways in the first version of the implementation. And use unsafe views to improve the performance of critical parts with caution (preferably under the guidance of profiler). In terms of efficient array views, that's a complex problem. People have been working on it for months, still the progress is very limited. Therefore, such stopgaps are introduced to temporarily work around the problem. Sometimes, we have to make tradeoffs from a practical standpoint. |
6c7c7e3
to
1a4c02f
Compare
The code gen for the core indexing computations – both multidimensional and linear – in c980bae makes me hopeful that we can get array views to be a very fast core component of the language. This approach is really making me wonder if we shouldn't have a simple single-dimensional
Vec
orBuffer
type that underlies all arrays, in terms of which they can be defined. We could even hide it from sight for the most part, but it makes defining arrays in Julia much more doable.